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Abstract 

We consider the problem of using a factor model we call spike- and- slab sparse 
coding (S3C) to learn features for a classification task. The S3C model resembles 
both the spike-and-slab RBM and sparse coding. Since exact inference in this 
model is intractable, we derive a structured variational inference procedure and 
employ a variational EM training algorithm. Prior work on approximate inference 
for this model has not prioritized the ability to exploit parallel architectures and 
scale to enormous problem sizes. We present an inference procedure appropriate 
for use with GPUs which allows us to dramatically increase both the training set 
size and the amount of latent factors. 

We demonstrate that this approach improves upon the supervised learning capabil- 
ities of both sparse coding and the ssRBM on the CIFAR-10 dataset. We evaluate 
our approach's potential for semi-supervised learning on subsets of CIFAR-10. 
We use our method to win the NIPS 2011 Workshop on Challenges In Learning 
Hierarchical Models' Transfer Learning Challenge. 



1 The S3C model 



The S3C model consists of latent binary spike variables h E {0, 1}^, latent real-valued slab vari- 
ables s € Mr, and real-valued D-dimensional visible vector v € M. D generated according to this 
process: Vi G {1, N}, d <E {1, D}, 

p(h t = l) = a{h), p( Si \ hi)=M( Si j hi^ar, 1 ), P (v d | s, h) = J\f(v d | W d: {hos),^) (1) 

where a is the logistic sigmoid function, & is a set of biases on the spike variables, /i and W govern 
the linear dependence of s on h and v on s respectively, a and j3 are diagonal precision matrices of 
their respective conditionals, and ho s denotes the element-wise product of h and s. 

To avoid overparameterizing the distribution, we constrain the columns of W to have unit norm, as 
in sparse coding. We restrict a to be a diagonal matrix and /3 to be a diagonal matrix or a scalar. We 
refer to the variables hi and Si as jointly defining the i th hidden unit, so that there are are total of N 
rather than 2N hidden units. The state of a hidden unit is best understood as hiSi, that is, the spike 
variables gate the slab variables. 

In the subsequent sections we motivate our use of S3C as a feature discovery algorithm by describ- 
ing how this model occupies a middle ground between sparse coding and the ssRBM. The S3C 
model avoids many disadvantages that the ssRBM and sparse coding have when applied as feature 
discovery algorithms. 

1.1 Comparison to sparse coding 



Sparse coding has been widely used to discover features for classification ( Raina et al. 2007 1. Re 



cently |Coates and Ng| ( |201 la| l showed that this approach achieves excellent performance on the 
CIFAR10 object recognition dataset. 



Sparse coding (Olshausen and Field 



1997} describes a class of generative models where the ob- 



served data v is normally distributed given a set of continuous latent variables s and a dictionary 



1 



matrix W: v ~ Af(Ws, crl). Sparse coding places a factorial prior on s such as a Cauchy or Laplace 
distribution, chosen to encourage the posterior mode of p(s | v) to be sparse. One can derive the S3C 
model from sparse coding by replacing the factorial Cauchy or Laplace prior with a spike-and-slab 
prior. 

One drawback of sparse coding is that the latent variables are not merely encouraged to be sparse; 
they are encouraged to remain close to 0, even when they are active. This kind of regularization is 
not necessarily undesirable, but in the case of simple but popular priors such as the Laplace prior 
(corresponding to an L\ penalty on the latent variables s), the degree of regularization on active units 
is confounded with the degree of sparsity. There is little reason to believe that in realistic settings, 
these two types of complexity control should be so tightly bound together. The S3C model avoids 
this issue by controlling the sparsity of units via the b parameter that determines how likely each 
spike unit is to be active, while separately controlling the magnitude of active uits via the n and a 
parameters that govern the distribution over s. Sparse coding has no parameter analogous to /i and 
cannot control these aspects of the posterior independently. 

Sparse coding is also difficult to integrate into a deep generative model of data such as natural im- 
ages. While Yu et q/.| ( |201 1[ ) and Zeiler et al. ( 2009) 1 have recently shown some success at learning 



hierarchical sparse coding, our goal is to integrate the feature extraction scheme into a proven gen 
erative model framework such as the deep Boltzmann Machine (Salakhutdinov and Hinton 2009). 



Existing inference schemes known to work well in the DBM-type setting are all either sample-based 
or are based on variational approximations to the model posteriors, while sparse coding schemes 
typically employ MAP inference. Our use of variational inference makes the S3C framework well 
suited to integrate into the known successful strategies for learning and inference in DBM models. 
It is not obvious how one can employ a variational inference strategy to standard sparse coding with 
the goal of achieving sparse feature encoding. 

1.2 Comparison to Restricted Boltzmann Machines 

The S3C model also resembles another class of models commonly used for feature discovery: the 
RBM. An RBM ( Smolensky 1986[) is an energy -based model defined through an energy function 



that describes the interactions between the obversed data variables and a set of latent variables. It is 
possible to interpret the S3C as an energy-based model, by rearranging p(v, s, h) to take the form 
exp{— E(v, s, h)}/Z, with the following energy function: 

E(v, s, h) = ~ ( v - ^2 WiSihi j p(v~J2 WiSihi) + ~ ^ q 4 (s 4 - ^hi) 2 - hhi, (2) 

\ i / \ i / i=l i=l 

The ssRBM model family is a good starting point for S3C because it has demonstrated both rea- 
sonable performance as a feature discovery scheme and remarkable performance as a generative 
model ( |Courville et al.\ 201 1} . Within the ssRBM family, S3C's closest relative is a variant of the 



pt-ssRBM, defined by the following energy function: 

N N N 

E(v,s,h) = -^2,v T (3W i s l h i + -v T /3v + -^2cti(si - fain) 2 - bjhj, (3) 

i—l i — l i = l 

where the variables and parameters are defined identically to the S3C. Comparison of equations [2] 
andplreveals that the simple addition of a latent factor interaction term ^(h o s) T W T f3W{h o s) to 
the ssRBM energy function turns the ssRBM into the S3C model. With the inclusion of this term 
S3C moves from an undirected ssRBM model to the directed graphical model, described in equation 
([TJ. This change from undirected modeling to directed modeling has three important effects, that we 
describe in the following sections. 

The effect on the partition function: The most immediate consequence of the transition to directed 
modeling is that the partition function becomes tractable. This changes the nature of learning algo- 
rithms that can be applied to the model, since most of the difficulty in training an RBM comes from 
estimating the gradient of the log partition function. The partition function of S3C is also guaranteed 
to exist for all possible settings of the model parameters, which is not true of the ssRBM. 

The effect on the posterior: RBMs have a factorial posterior, but S3C and sparse coding have a 
complicated posterior due ot the "explaining away" effect. This means that for RBMs, features 
defined by similar basis functions will have similar activations, while in directed models, similar 
features will compete so that only the most relevant feature will remain active. As shown by |Coates| 
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Figure 1: (Left) Q imposes a sparse distribution on h; Q(hi) < .01 91.8% of the time. The samples in this 
histogram are values of Q(hi) for 1600 different hidden units from a trained model applied to 100 different 
image patches. (Right) The inference procedure sparsifies the representation due to the explaining-away effect. 
Q is initialized at the prior, which is very sparse. The data then drives Q to become less sparse, but subsequent 
iterations make Q become sparse again. 



and Ng (201 la i, the sparse Gaussian RBM is not a very good feature extractor - the set of basis 
functions W learned by the RBM actually work better for supervised learning when these parameters 
are plugged into a sparse coding model than when the RBM itself is used for feature extraction. We 
think this is due to the factorial posterior. In the vastly overcomplete setting, being able to selectively 
activate a small set of features likely provides S3C a major advantage in discriminative capability. 

The effect on the prior: The addition of the interaction term causes S3C to have a factorial prior. 
This probably makes it a poor generative model, but this is not a problem for the purpose of feature 
discovery. 

2 Other Related work 



The notion of a spike-and-slab prior was established in statistics by Mitchell and Beauchamp (1988 ). 
Outside the context of unsupervised feature discovery for supervised, semi-supervised and self- 
taught learning, the basic form of the S3C model (i.e. a spike-and-slab latent factor model) has 



appeared a number of times in different domains ( Liicke and Sheikh 20 1 1 ; Garrigue s and Olshausen 
|2008[ Mohamed et ai\ |20 1 1 1 |Titsias and LazarcnG redilla , 2 01 1[ ). To this literature, we contribute an 
inference scheme that scales to the kinds of object classifications tasks that we consider. We outline 
this inference scheme next. 

3 Variational EM for S3C 

Having explained why S3C is a powerful model for unsupervised feature discovery we turn to the 
problem of how to perform learning and inference in this model. Because computing the exact 
posterior distribution is intractable, we derive an efficient and effective inference mechanism and a 
variational EM learning algorithm. 



We turn to variational EM ( |Saul and Jordan|[l996| l because this algorithm is well-suited for models 
with latent variables whose posterior is intractable. It works by maximizing a variational lower 
bound on the log-likelihood called the energy functional (Neal and Hinton 1999[ ). More specifically, 
it is a variant of the EM algorithm (Dempst er et al.\\\911\ with the modification that in the E-step, 
we compute a variational approximation to the posterior rather than the posterior itself. While our 
model admits a closed-form solution to the M-step, we found that online learning with small gradient 
steps on the M-step objective worked better in practice. We therefore focus our presentation on the 
E-step, given in Algorithm [T] 

The goal of the variational E-step is to maximize the energy functional with respect to a distribution 
Q over the unobserved variables. We can do this by selecting the Q that minimizes the Kullback- 

Leibler divergence: , , ,„ , , _ 

V KL (Q(h,s)\\P(h, S \v)) (4) 
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where Q(h, s) is drawn from a restricted family of distributions. This family can be chosen to ensure 
that Q is tractable. 

Our E-step can be seen as analogous to the encoding step of the sparse coding algorithm. The key 
difference is that while sparse coding approximates the true posterior with a MAP point estimate of 
the latent variables, we approximate the true posterior with the distribution Q. We use the family 

Q{h,s) = UiQ(hi,Si). 



Observing that eq. Q is an instance of the Euler-Lagrange equation (Gelfand 1963| l, we find that 
the solution must take the form 

Q(hi) = h h Q( Sl | hi) = Af( Si | Mi, (<*i + hiWlPWi)' 1 ) (5) 

where hi and s$ must be found by an iterative process. In a typical application of variational in- 
ference, the iterative process consists of sequentially applying fixed point equations that give the 
optimal value of the parameters hi and s, for one factor Q(hi, sA given the value all of the other 
factors' parameters. This is for example the approach taken by Titsias and Lazaro-Gredilla (2011 1 



who independently developed a variational inference procedure for the same problem. This process 
is only guaranteed to decrease the KL divergence if applied to each factor sequentially, i.e. first 
updating hi and si to optimize Q(hi, si), then updating h\ and §2 to optimize Q(h2, S2), and so 
on. In a typical application of variational inference, the optimal values for each update are simply 
given by the solutions to the Euler-Lagrange equations. For S3C, we make three deviations from 
this standard approach. 

Because we apply S3C to very large-scale problems, we need an algorithm that can fully exploit the 
benefits of parallel hardware such as GPUs. Sequential updates across all TV factors require far too 
much runtime to be competetive in this regime. 

We propose a different method that enables parallel updates to all units. First, we partially minimize 
the KL divergence with respect to s. The terms of the KL divergence that depend on s make up a 
quadratic function so this can be minimized via conjugate gradient descent. We implement conjugate 
gradient descent efficiently by using the R-operator to perform Hessian-vector products rather than 
computing the entire Hessian explicitly (Schraudolph, 2002). This step is guaranteed to improve the 
KL divergence on each iteration. 

We next update s in parallel, shrinking the update by a damping coefficient. This approach is not 
guaranteed to decrease the KL divergence on each iteration but it is a widely applied approach that 
works well in practice (Koller and Friedman 2009}. 



In practice we find that we can obtain a faster algorithm that reaches equally good solutions by 
replacing the conjugate gradient update to s with a more heuristic approach. We use a parallel 
damped update on s much like what we do for h. In this case we make an additional heuristic 
modification to the update rule which is made necessary by the unbounded nature of s. We clip 
the update to s so that if s new has the opposite sign from s, its magnitude is at most ps. In all of 
our experiments we used p = 0.5 but any value in [0, 1] is sensible. This prevents a case where 
multiple mutually inhibitory s units inhibit each other so strongly that rather than being driven to 
they change sign and actually increase in magnitude. This case is a failure mode of the parallel 
updates that can result in s amplifying without bound if clipping is not used. 

We include some visualizations that demonstrate the effect of our E-step. Figure [T] (right) shows 
that it produces a sparse representation. Figure [T] (left) shows that the explaining-away effect incre- 
mentally makes the representation more sparse. Figure [2] (left) shows that the E-step increases the 
energy functional. 

4 Results 

We conducted experiments to evaluate the usefulness of S3C features for supervised learning and 
semi-supervised learning on CIFAR-10 (Rrizhevsky and Hinton 2009|l, a dataset consisting of color 



images of animals and vehicles. It contains ten labeled classes, with 5000 train and 1000 test exam- 
ples per class. 



For all experiments, we used the same procedure as Coates and Ng|(f2"011a|). CIFAR-10 consists of 



32 x 32 images. We train our feature extractor on 6 x 6 contrast-normalized and ZCA-whitened 
patches from the training set. At test time, we extract features from all 6 x 6 patches on an image, then 
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Inference by Optimization 



CIFAR-10 Learning Curve 




Damped parallel fixed point updates 




Figure 2: (Left) The energy functional of a batch of 5000 image patches increases during the E-step. (Right) 
Semi-supervised classification accuracy on CIFAR-10. In both cases the hyperparameters for the unsupervised 
stage were optimized for performance on the full CIFAR-10 dataset, not re-optimized for each point on the 
learning curve. 



Algorithm 1 Fixed-Point Inference 



Initialize h<°> = tr(b) and i*"' = fi. 
fork=0:Kdo 

Compute the individually optimal value s* for each i simultaneously: 



(°) 



Clip reflections by assigning 



for alii such that signfs* ) sign^*- 1 ) and |s* | > p|s: ^ |, and assigning — s* for all other i 



psign(s ^ )\s ' 



Damp the updates by assigning 



where rj £ (0,1]. 

Compute the individually optimal values for h: 



= r]c + (1 — r])s 



: ' ' 'I.'' 1 1 I /3Wis\ k+1> +bi 



■ W i S\ k+1 >h'-.' - -Wis 



Vi) 2 - ^ log(e»« + W^PWi) + i log( aii )j 



Damp the update to h: 



average-pool them. The average-pooling regions are arranged on a non-overlapping grid. Finally, 
we train a linear SVM on the pooled features. 



Coates and Ng| ( [201 la| > used 1600 basis vectors in all of their sparse coding experiments. They post- 
processed the sparse coding feature vectors by splitting them into the positive and negative part for 
a total of 3200 features per average-pooling region. They average-pool on a 2 x 2 grid for a toal of 
12,800 features per image. We used Eg [h] as our feature vector. This does not have a negative part, 
so using a 2 x 2 grid we would have only 6,400 features. In order to compare with similar sizes of 
feature vectors we used a 3 x 3 pooling grid for a total of 14,400 features. 

4.1 CIFAR-10 

On CIFAR-10, S3C achieve s a test set accuracy of 7 8.3 ± 0.9 % with 95% confidence (or 76.2 ± 0.9 
% when using a 2 x 2 grid). |Coates and Ng| ( [201 la| l do not report test set accuracy for sparse coding 
with "natural encoding" (i.e., extracting features in a model whose parameters are all the same as in 
the model used for training) but sparse coding with different paramet ers for feature extraction than 



training achieves an accuracy of 78.8 ± 0.9% (Coates and Ng 



201 la). Since we have not enhanced 



our performance by modifying parameters at feature extraction time these results seem to indicate 
that S3C is roughly equivalent to sparse coding for this classification task. S3C also outperforms 
ssRBMs, which require 4,096 basis vectors per patch and a 3 x 3 pooling grid to achieve 76. 7± 0.9% 
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accuracy. All of these approaches are close to the state of the art of 82.0 ± 0.8 %, which used a three 
layer network (Co ates and Ng||201 lb[ ). 

We also used CIFAR-10 to evaluate S3C's semi-supervised learning performance by training the 
SVM on small subsets of the CIFAR-10 training set, but using features that were learned on the entire 
CIFAR-10 train set. The results, summarized in Figure|2](right) show that S3C is most advantageous 
for medium amounts of labeled data. S3C features thus include an aspect of flexible regularization- 
they improve generalization for smaller training sets yet do not cause underrating on larger ones. 

5 Transfer Learning Challenge 



For the NIPS 2011 Workshop on Challenges in Learning Hierarchical Models (Le et al. 201 \\ 



the organizers proposed a transfer learning competition. This competition used a dataset consisting 
of 32 x 32 color images, including 100,000 unlabeled examples, 50,000 labeled examples of 100 
object classes not present in the test set, and 120 labeled examples of 10 object classes present in 
the test set. The test set was not made public until after the competition. We chose to disregard the 
50,000 labels and treat this as a semi-supervised learning task. We applied the same approach as on 
CIFAR-10 and won the competition, with a test set accuracy of 48.6 %. 

6 Conclusion 

We have motivated the use of the S3C model for unsupervised feature discovery. We have de- 
scribed a variational approximation scheme that makes it feasible to perform learning and inference 
in large-scale S3C models. Finally, we have demonstrated that S3C is an effective feature discovery 
algorithm for supervised, semi-supervised, and self-taught learning. 
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